\(F\) is linear, so the model is linear in parameters. \(x\hat{\beta}\) is the linear prediction and is the quantity of interest, the conditional expected value of \(y\).
Now, consider a situation where \(y\) is binary, such that,
where \(y_i^*\) is what we wish we could measure (say, the probability \(y_i=1\)), though we can only measure \(y_i\). Now, \(y^*\) is going to be our principle quantity of interest.
\(F\) is a nonlinear function, so the model is nonlinear in parameters. \(x\hat{\beta}\) is the linear prediction, but it is not the quantity of interest. Instead, the quantity of interest is \(F(x\hat{\beta})\) which is equal to \(y_i^*\).
Important Concept
The difference between this model and the OLS linear model is simply that we must transform the linear prediction, \(x\hat{\beta}\), by \(F\) in order to produce predictions. Put differently, we want to map our linear prediction, \(x\beta\) onto \(y^*\).}
Why leave the robust OLS model?
Why not use the linear model when \(y\) is binary?
because we fail to satisfy the OLS assumptions.
because the residuals are not Normal.
because \(y\) is not Normal.
because \(y\) is limited.
Limited \(y\) variables are \(y\)s where our measurement is limited by the realities of the world. Such variables are rarely normal, often not continuous, and often observable indicators of unobservable things - this is all true of binary variables.
Limited Dependent Variables
Why would we measure \(y_i\) rather than \(y_i^*\)?
Limited dependent variables are usually limited in the sense that we cannot observe the range of the variable or the characteristic of the variable we want to observe. We are limited to observing \(y_i\), and so must estimate \(y_i^*\).
unordered or nominal categorical variables: type of car you prefer: Honda, Toyota, Ford, Buick; policy choices; consumer choices.
ordered variables that take on few values: some survey responses.
-discrete count variables; number of episodes of scarring torture in a country-year, 0, 1, 2, 3, …, \(\infty\)
time to failure; how long a civil war lasts; how long a patient survives disease; how long a leader survives in office.
Binary dependent variables
Generally, we conceive of a binary variable as being the observable manifestation of some underlying, latent, unobserved continuous variable.
If we could adequately observe (and measure) the underlying continuous variable, we’d use some form of OLS regression to analyze that variable.
Why not use OLS?
\[ \mathbf{y}=\mathbf{X \beta} + \mathbf{u} \]
where we are principally interested in the conditional expectation of \(y\), \(E(y_{i}|\mathbf{x_{i}})\) where we want to interpret that expectation as a conditional probability, \(Pr(y=1|\mathbf{x_{i}})\); we focus on the probability the outcome occurs (i.e., \(y\) is equal to one).
Linear Probability Model
The linear probability model (LPM) is the OLS linear regression with a binary dependent variable.
The main justification for the LPM is OLS is unbiased (by Gauss Markov). But \(\ldots\)
predictions are nonsensical (linear, unbounded, measures of \(\hat{y}\) rather than \(y^*\)).
disturbances are non-normal, and heteroskedastic.
relation or mapping of \(x\beta\) and \(y\) are the wrong functional form (linear).
Running example - Democratic Peace data
As a running example, I’ll use the Democratic Peace data to estimate logit and probit models. These come from Oneal and Russett (1997)’s well-known study in ISQ. The units are dyad-years; the \(y\) variable is the presence or absence of a mililtarized dispute, and the \(x\) variables include a measure of democracy (the lowest of the two Polity scores in the dyad), and a set of controls.
Predictions out of bounds
code
dp <-read_dta("/Users/dave/Documents/teaching/501/2023/slides/L7_limiteddv/code/dp.dta")m1 <-glm(dispute ~ border+deml+caprat+ally, family=binomial(link="logit"), data=dp )logitpreds <-predict(m1, type="response")mols <-lm(dispute ~ border+deml+caprat+ally, data=dp )olspreds <-predict(mols)df <-data.frame(logitpreds, olspreds, dispute=as.factor(dp$dispute))ggplot(df, aes(x=logitpreds, y=olspreds, color=dispute)) +geom_point()+labs(title="Predictions from Logit and OLS", x="Logit Predictions", y="OLS Predictions")+geom_hline(yintercept=0)+theme_minimal() +annotate("text", x=.05, y=-.05, label="2,147 Predictions out of bounds", color="red")
code
ggplot(df, aes(x=olspreds)) +geom_density(alpha=.5)+labs(title="Density of OLS Predictions", x="Predictions", y="Density")+theme_minimal()+geom_vline(xintercept=0, linetype="dashed")
In the linear model, \(\hat{y_i}=x_i\beta\). This makes sense because \(y = y^*\). Put differently, \(y\) is continuous, unbounded, (assumed) normal, and is an “unlimited” measure of the concept we intend to measure.
In binary models, \(y \neq y^*\), because our observation of \(y\) is limited such that we can only observe its presence or absence. We have two different realizations of the same variable: \(y\) is the limited but observed variable; \(y^*\) is the unlimited variable we want to measure, but cannot because it is unobservable.
The goal of these models is to use \(y\) in the regression in order to get estimates of \(y^*\). Those estimates of \(y^*\) are our principle quantity of interest in the binary variable model.
Linking \(x\widehat{\beta}\) and \(y^*\)
We can produce the linear prediction, \(x\widehat{\beta}\), but we need to transform it to produce estimates of \(y^*\). To do so, we use a link function to map \(x_i\beta\) onto the probability space, \(y^*\). This means \(\widehat{y_i} \neq x\widehat{\beta}\). Instead,
\[y^* = F(x_i\beta)\]
Where \(F\) is a continuous, sigmoid probability CDF. This is how we get estimates of our quantity of interest, \(y^*\).
Non linear change in Pr(y=1) across values of \(x\)
In the LPM, the relationship between \(Pr(y=1)\) and \(X\) is linear, so the rate of change toward \(Pr(y=1)\) is constant across all values of \(X\).
This means that the rate of change approaching one (or approaching zero) is exactly the same as the rate of change anywhere else in the distribution.
For example, this means that the change from .99 to 1.00 is just as likely as the change from .50 to .51; is this sensible for a bounded latent variable (probability)?
The residuals are not normally distributed (except asymptotically). Suppose \[\begin{aligned}
y_{i}=1;~~~ u_{i} = 1-x_{i} \hat{\beta} \nonumber \\
y_{i}=0;~~~ u_{i} = -x_{i} \hat{\beta} \nonumber
\end{aligned}\]
The disturbance term, \(u_{i}\) only takes on two values (just like \(y_{i}\)); it follows the binomial rather than the normal distribution. This is not necessarily that serious a problem since the OLS estimates will still be unbiased.
}
The disturbances are heteroskedastic} The conditional expected value of \(Y\) is equivalent to the conditional probability of \(Y^*\) (\(E(y_{i}|\mathbf{x_{i}}=Pr(y=1|\mathbf{x_{i}})\)), the variance of the disturbance term, \(u\) is \[\begin{aligned}
var(u_{i}) = E(Y_{i}|X_{i})[1-E(Y_{i}|X_{i})] \nonumber\\
=p_{i}(1-p_{i}) \nonumber
\end{aligned}\]
So the variance of the disturbance term depends explicitly on the conditional expectation of \(Y\) which is conditional on \(X\). Put another way, the variance of \(u\) is dependent on the independent variables and so is neither homoskedastic nor independent of the \(X\)s, nor likely of \(var(u_{j})\).
}
Predictions}
The predictions of \(Y\) (the conditional expectation \(E(y_{i}|\mathbf{x_{i}})\)) are not necessarily bounded by zero and one: \(0 \leq E(y_{i}|\mathbf{x_{i}}) \leq 1\) is not always fulfilled. \~\
Linearity doesn’t seem like the right functional form.
}
Examples} This week’s video runs through some examples of these problems. }
Why move to ML? Lipstick on a pig …}
OLS is a rockin’ estimator, but it’s just not well suited to limited \(y\) variables. Efforts to rehabilitate the LPM are like putting lipstick on a pig.
Deriving an LLF from the ground up
So let’s build a model for a binary \(y\) variable.
Observe \(y\), consider its distribution, write the PDF.
Write the joint probability of the data, using the chosen probability distribution.
Write the joint probability as a likelihood:
Simplify - take logs, etc.
Parameterize
Write in the link function, linking the systematic component of the model to the latent variable, \(\tilde{y}\).
A nonlinear model for binary data
So \(y\) is binary, and we’ve established the linear model is not appropriate. The observed variable, \(y\), appears to be binomial (iid):
Probit - link between \(x\hat{\beta}\) and \(Pr(y=1)\) is standard normal CDF: \[
\ln \mathcal{L} (Y|\beta) = \sum_{i=1}^{N} y_i \ln \Phi(\mathbf{x_i \beta})+ (1-y_i) \ln[1-\Phi(\mathbf{x_i \beta})] \nonumber
\]
Sigmoid Functions There’s a large number of sigmoid shaped probability functions that will satisfy these needs.
On Linearity
In the linear model, \(\hat{y_i}=x_i\beta\). This makes sense because \(y = \tilde{y}\). Put differently, \(y\) is continuous, unbounded, (assumed) normal, and is an unlimited measure of the concept we intend to measure. Also, \(x_i\beta\) is in units of \(y\), so no mapping is necessary.
In binary models, \(y \neq \tilde{y}\), because our observation of \(y\) is limited such that we can only observe its presence or absence. We have two different realizations of the same variable: \(y\) is the limited but observed variable; \(\tilde{y}\) is the unlimited variable we want to measure, but cannot because it is unobservable}.
The goal of these models is to use \(y\) in the regression in order to get estimates of \(\tilde{y}\). Those estimates are our principle quantity of interest} in the binary variable model.
Since \(y \neq \tilde{y}\), we use the link} function to map \(x_i\beta\) onto the space \(\tilde{y}\)}. This means \(\hat{y_i} \neq x_i\beta\). Instead,
\[
\tilde{y} = F(x_i\beta) \nonumber
\] Thus, we get estimates of our quantity of interest, \(\tilde{y}\).
In the LPM, the relationship between \(Pr(y=1)\) and \(X\) is linear, so the rate of change toward \(Pr(y=1)\) is constant across all values of \(X\).
This means that the rate of change approaching one (or approaching zero) is exactly the same as the rate of change anywhere else in the distribution.
For example, this means that the change from .99 to 1.00 is just as likely as the change from .50 to .51; is this sensible for a bounded latent variable (probability)?
Non constant change in Pr(y=1) across values of \(z\)
Sigmoid link functions
return probabilities in the appropriate bounds, mapping \(x\beta\) onto \(\tilde{y}\).
permit different rates of change in \(\tilde{y}\) across \(x\). The marginal effect of \(x\) is not \(\widehat{\beta}\) (more on this shortly).
“compress” effects of extreme values of \(x\) on \(Pr(y=1)\).
suggest questions about symmetry and transition probabilities (i.e. is .5 always a sensible transition from the probability of zero to probability of one?).
Binary response interpretation
Signs and significance - all the usual rules apply.
Quantities of interest - most commonly \(Pr(y=1|X)\); or marginal effects.
Measures of uncertainty (e.g. confidence intervals) are a must (as always).
Non linear models: Predicted probabilities
In the nonlinear model, the most basic quantity is
\[F(x\widehat{\beta})\]
where \(F\) is the link function, mapping the linear prediction onto the prediction space.
In the linear model, the marginal effect of \(x\) is \(\widehat{\beta}\). That is, the effect of a one unit change in \(x\) on \(y\) is \(\widehat{\beta}\).
The marginal effect is constant with respect to \(x_k\).
Marginal Effects, Nonlinear Model
In the nonlinear model, the marginal effect of \(x_k\) depends on where \(x\widehat{\beta}\) lies with respect to the probability distribution \(F(\cdot)\).
This is \(\widehat{\beta}\) weighted by or measured at the ordinate on the PDF - the ordinate is the height of the PDF associated with a value of the \(x\) axis (an abscissa).
Recall that \(\Lambda\) is the logit CDF (\(1/(1+exp(-x_i\widehat{\beta}))\)), and \(\lambda\) is the logit PDF (\(1/(1+exp(-x_i\widehat{\beta}))^2\)).
means the marginal effect is \(\widehat{\beta}\) weighted by the probability of \(y=1\) times the probability of \(y=0\). Since the largest value this can take on is \(Pr(y_i=1)=0.5 \cdot Pr(y_i=0)=0.5= 0.25\), then the maximum marginal effect is \(0.25 \widehat{\beta}\).
Visualizing Logit Marginal Effects
code
z <-seq(-5,5,.1)p <-plogis(z)d <-dlogis(z)df <-data.frame(z=z, p=p, d=d)#plot pdf and cdf with reference line at y=.39ggplot(df, aes(x=z)) +geom_line(aes(y=d), color="black")+geom_line(aes(y=p), color="red")+geom_hline(yintercept=.25, linetype="dashed")+labs(title="Logistic PDF and CDF", x="z", y="F(z)")+theme_minimal()
The ordinate at the maximum of the standard normal PDF is 0.3989 - rounding to 0.4, we can say that the maximum marginal effect of any \(\widehat{\beta}\) in the probit model is \(0.4\widehat{\beta}\).
The ordinate is at the maximum where \(z=0\); recall this is the standard normal, so \(x_i\widehat{\beta}=z\). When \(z=0\),
z <-seq(-5,5,.1)p <-pnorm(z)d <-dnorm(z)df <-data.frame(z=z, p=p, d=d)#plot pdf and cdf with reference line at y=.39ggplot(df, aes(x=z)) +geom_line(aes(y=d), color="black")+geom_line(aes(y=p), color="red")+geom_hline(yintercept=.3989, linetype="dashed")+labs(title="Standard Normal PDF and CDF", x="z", y="F(z)")+theme_minimal()
Marginal Effects in the Nonlinear Model
code
z <-seq(-5,5,.1)ncdf <-pnorm(z)npdf <-dnorm(z)lcdf <-plogis(z)lpdf <-dlogis(z)df <-data.frame(ncdf=ncdf, npdf=npdf, lcdf=lcdf, lpdf=lpdf, z=z)#plot pdf and cdf with reference line at y=.39# ggplot(df, aes(x=z)) + # geom_line(aes(y=npdf), color="black")+# geom_line(aes(y=ncdf), color="black")+# geom_line(aes(y=lpdf), color="blue")+# geom_line(aes(y=lcdf), color="blue")+# geom_hline(yintercept=.3989, linetype="dashed")+# geom_hline(yintercept=.25, linetype="dashed")+# labs(title="Standard Normal and Logistic PDFs and CDFs", x="z", y="F(z)")+# theme_minimal()# highchart() %>%hc_add_series(df, "line", hcaes(x = z, y = npdf)) %>%hc_add_series(df, "line", hcaes(x = z, y = ncdf)) %>%hc_add_series(df, "line", hcaes(x = z, y = lpdf)) %>%hc_add_series(df, "line", hcaes(x = z, y = lcdf)) %>%hc_add_theme(hc_theme_flat()) %>%hc_xAxis(title =list(text ="z")) %>%hc_yAxis(title =list(text ="F(z)")) %>%hc_add_theme(hc_theme_flat()) %>%hc_legend(enabled =FALSE)
Logit Odds Interpretation
The odds are given by the probability an event occurs divided by the probability it does not:
Not only is it simple to exponentiate \(\widehat{\beta_k}\), but the interpretation is that \(x\) increases/decreases \(Pr(y=1)\) by that factor, \(exp(\widehat{\beta_k})\), and more usefully, that:
\[
100*(exp(\widehat{\beta_k})-1) \nonumber
\]
is the percentage change in the odds given a one unit change in \(x_k\).
So a logit coefficient of .226
\[
100*(exp(.226)-1) =25.36 \nonumber
\]
Produces a 25.36% increase in the odds of \(y\) occurring.
Interpreting Binary Models
Probit and logit coefficients {are} directly interpretable in the senses that
We can interpret direction.
We can interpret statistical difference from zero.
We can say the largest marginal effect of \(x \approx 0.4\cdot\widehat{\beta}\) for the probit model.
We can say the largest marginal effect of \(x \approx 0.25\cdot\widehat{\beta}\) for the logit model.
We can say that \(exp(\widehat{\beta_k})-1\) is the percentage change in the odds that \(y=1\), for the logit model.
It’s still the case that we often want other {quantities of interest} like probabilities, and that requires the straightforward transformations of the linear prediction, \(F(x_i\widehat{\beta})\).
Two general types of predictions
MEM - Marginal Effects at Means - set variables to means/medians/modes, vary \(x\) of interest, generate effect.
AME - Average Marginal Effects - set \(x\) of interest to value of interest, predict, average, then repeat at next value of interest.
Marginal Effects at Means (MEM)
MEMs are what they sound like - effects with independent variables set at central tendencies.
estimate model. create out of sample data. vary \(x\) of interest; set all other \(x\) variables to appropriate central tendencies - hence the “at Means.” generate QIs in out of sample data.
Average Marginal Effects (AME)
Average Marginal Effects are in-sample but create a counterfactual for a variable of interest, assuming the entire sample looks like that case.
For instance, suppose a model of wages with covariates for education and gender. We might ask the question what would the predictions look like if the entire sample were male, but otherwise looked as it does? Alternatively, what would the predictions look like if the entire sample were female, but all other variables the same as they appear in the estimation data?
To answer these, we’d change the gender variable to male, generate \(x{\widehat{\beta}}\) for the entire sample, and take the average, then repeat with the gender variable set to female.
Average Marginal Effects (AME)
estimate model.
in estimation data, set variable of interest to a particular value for the entire estimation sample.
generate QIs (expected values, standard errors).
take average of QIs, and save.
repeat for all values of variable of interest, and plot.
Methods for Quantities of Interest
direct computation - generate \(F(x\widehat{\beta})\) for interesting values of \(x\) (either as MEM or AME).
simulation of \(\widehat{\beta}\)
simulation of QI.
Uncertainty
We have two main quantities of interest - everything so far focused on a generating a predicted value from the model. Let’s think about generating measures of uncertainty for those predicted values.
This section examines ways to compute standard errors, and ways to use those to compute confidence intervals.
Uncertainty: Standard Errors of Linear Predictions
Consider the linear prediction
\[X \widehat{\beta} \]
under maximum likelihood theory:
\[var(X \widehat{\beta}) = \mathbf{X V X'} \]
an \(N x N\) matrix, where \(V\) is the var-cov matrix of \({\widehat{\beta}}\). The main diagonal contains the variances of the \(N\) predictions. The standard errors are:
\[se(X \widehat{\beta}) = \sqrt{diag(\mathbf{X V X'})} \]
which is an \(N x 1\) vector.
Uncertainty: Delta Method
The ML method is appropriate for monotonic functions of \(X \widehat{\beta}\), e.g. logit, probit. In other models (e.g., multinomial logit), the function is not monotonic in \(X \widehat{\beta}\) so we use the Delta Method - this creates a linear approximation of the function. Greene (2012: 693ff) gives this as a general derivation of the variance:
\[Var[F(X \widehat{\beta})] = f(\mathbf{x'\widehat{\beta}})^2 \mathbf{x' V x} \]
Where this would generate variances for whatever \(F(X \widehat{\beta})\) is, perhaps a predicted probability.
Uncertainty: Standard Errors of \(p\) in Logit
By delta transformation is given by:
\[F(X \widehat{\beta}) * (1-F(X \widehat{\beta}) * \mathbf(X V X')\]
estimate the model, generate the linear prediction, and the standard error of the linear prediction using the either ML or Delta.
generate linear boundary predictions, \(x{\widehat{\beta}} \pm c * \text{st. err.}\) where \(c\) is a critical value on the normal, eg. \(z=1.96\).
transform the linear prediction and the upper and lower boundary predictions by \(F(\cdot)\).
With ML standard errors, EPT boundaries will obey distributional boundaries (ie, won’t fall outside 0-1 interval for probabilities); the linear end point predictions are symmetric, though they will not be symmetric in nonlinear models.
With delta standard errors, bounds may not obey distributional boundaries.
Uncertainty: Simulating confidence intervals, I
draw a sample with replacement of size \(\tilde{N}\) from the estimation sample.
estimate the model parameters in that bootstrap sample.
using the bootstrap estimates, generate quantities of interest (e.g. \(x\widehat{\beta}\)) repeat \(j\) times.
collect all these bootstrap QIs and use either percentiles or standard deviations to measure uncertainty.
Uncertainty: Simulating confidence intervals, II
estimate the model.
generate a large sample distribution of parameters (e.g. using drawnorm).
generate quantities of interest for the distribution of parameters.
use either percentiles or standard deviations of the QI to measure uncertainty.
Oneal, John R., and Bruce M. Russett. 1997. “The Classic Liberals Were Right: Democracy, Interdependence, and Conflict, 1950-1985.”International Studies Quarterly 4 (2): 267–94.